Project 1 - Inverted Indexer
Assigned: 1/22/08 Due: 1/29/08 at 4:30 PM
Review:
An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an
inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words
in the documents to some sort of document identifier. For example, if given the following 2 documents:
Doc1:
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doc2:
Buffalo are mammals.
We could construct the following inverted file index:
Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2
Data Set:
As your final data set, you will use a crawl of wikipedia articles. The data set is located on the DFS at
/shared/wikipedia (roughly 12 GB of text in 193 files each sized roughly 64MB).
Each wikipedia article is on its own line. A typical line of the data set (article) will start as follows:
<page> <title>Critical theory</title>
... (snip) ...
<text xml:space="preserve">In the [[humanities]] and [[social sciences]], '''critical theory'''
has two quite different meanings with different origins and histories, one originating in [[social theory]]
and the other in [[literary criticism]]
... (snip)
Notice that the title of the article is delimited by an XML title tag, and the text of the article is delimited
by an XML text tag. The links in the article to other wikipedia articles are delimited by "[[" and "]]". Also,
certain links will look like this: "[[Longer Real Name|Short Name]]". These links appear in the article with "Short
Name" but link to an article titled "Longer Real Name".
Part I:
Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more
interesting properties of a document. In addition to this, the data set is full of markup, punctuation, and variations in
capitalization. For the first part of this project, run a word count over an interesting
corpus (e.g. Shakespeare or wikipedia) to identify any noise words. Your goal is to remove these words from your
final output as well as "scrub" punctuation and XML markup and normalize the words. One strategy to accomplish this might be
multiple map-reduce passes, but you are free to do this however you want. Lastly, include any assumptions you made
about punctuation removal/normalization of words in your write-up, extra karma for creating language-independent scrubbing
algorithm!
Part II:
For this portion of the assignment, you will design a MapReduce-based algorithm to calculate the inverted index
over the data. Your final inverted index should not contain the words identified in part I of this project.
Also, the format of your MapReduce output (i.e. the inverted index) must be simple enough to be machine-parseable;
it is not impossible to imagine your index being one of many data structures used in a search engine's indexing
pipeline. Lastly, your submitted indexer should be able to run successfully on the wikipedia corpus. "Successfully"
means it should run to completion without errors or exceptions, and generate the correct word->title mapping.
Part III:
In addition to the requirements detailed above, implement at least one extension of your choice. The extensions may
be as simple or as complex as you'd like; the primarily goal is to give you a chance to play with the MapReduce system.
We have provided some sample extensions that you are free to choose from:
- Write a query program on top of your inverted file index, which will accept a user-specified word (or phrase!)
and return the IDs of the documents which contain that word.
- Instead of creating an inverted file index (which maps words to their document ID), create a full inverted
index (which maps words to their document ID + position in the document). How do you specify a word's position
in the document?
- If you created a full inverted index, write a query program on top of the index which returns not only the
document IDs but also a text "snippet" from each document showing where the query term appears in the document.
Write-up:
- How long do you think this project would take to finish?
- How much time did you actually spend on this project?
- What noisy words did you find as a result of your WordCount? By what
criteria did you decide where to draw the line? Were you surprised by any
of the words?
- Which extensions did you do? Please detail any interesting results
you found.
- Acknowledge any outside assistance you received from anyone except
assigned course readings and the course staff.
Hints:
- When writing your mapreduce code, test it on small data sets first, this will allow you to spot problems
much quicker. One suggestion to obtain a smaller data set, is to copy over the first few lines of one of the
193 parts of the wikipedia data and use that as a test data set.
- You can change how data is handled when read into your mapper. By default the input configuration is
"TextInputFormat" which breaks the input into lines and each mapper gets the line as a value and its offset in
the document as the key. To change it from the default, the relevant function is "setInputFormat"
on your job configuration in the map reduce driver. Specifically, consider the "KeyValueTextInputFormat"
- You can also control the sorting that happens during the shuffle phase of the map reduce. Normally all
output from the mapper is considered text and is sorted lexographically by key, however you can change the type
of key the mapper outputs. The relevant function is "setMapOutputKeyClass" on your job configuration in the
map reduce driver. Specifically consider "IntWritable"
- You can pass parameters to your mappers - for example if there is some small ammount of data loaded at runtime
that you want to be visible to all of your mappers. See this
link for details.
- The Hadoop API documentation is found at the hadoop website - It can be very useful.
Quick Link.
- Read the Cluster Notes and use the Job Tracker!
Turn-In instructions:
Please create an archive (preferably zip) that contains:
- All of your sourcecode (mapreduce and otherwise if any) that you wrote for your project.
- Your write up in text, doc, or pdf
Do not include any of your output, gigabytes of text, grandma's meat loaf recipe, etc.
Turn the archive in to the catalyst drop box here before 4:30 on 1/29/08